Skip to content

USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471

Open
copejon wants to merge 3 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401
Open

USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471
copejon wants to merge 3 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented Apr 7, 2026

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve USHIFT-6401, [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
    • Detects RBAC bootstrap deadlocks during startup by periodically probing readiness; logs the issue and attempts a bounded service restart to recover initialization, improving startup reliability without affecting normal readiness checks.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • Pure MicroShift-level fix without vendored code changes

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Walkthrough

Adds concurrent RBAC bootstrap hook deadlock detection to the kube-apiserver Run: a detector probes /readyz/poststarthook/rbac/bootstrap-roles and etcd health, signals confirmed deadlock, attempts recovery by stopping microshift-etcd.scope, and causes Run to return a formatted deadlock error.

Changes

Cohort / File(s) Summary
RBAC Deadlock Detection
pkg/controllers/kube-apiserver.go
Adds goroutine detectRBACHookDeadlock and rbacDeadlockDetected channel; extends Run to monitor the signal. Adds timing/constants, isEtcdHealthy (short-timeout etcd v3 TLS client), restartMicroshiftEtcdScope (runs systemctl stop microshift-etcd.scope), and deadlock confirmation/counting logic. Imports os/exec.

Sequence Diagram

sequenceDiagram
    participant Run as Run Method
    participant Poll as Readiness Polling
    participant Detector as RBAC Detector
    participant Hook as /readyz/poststarthook/rbac/bootstrap-roles
    participant Etcd as Etcd (v3)
    participant Systemctl as systemctl

    Run->>Poll: start readiness polling
    Run->>Detector: start RBAC deadlock detector goroutine

    Detector->>Detector: wait initial delay
    loop periodic checks
        Detector->>Hook: GET readiness endpoint
        Detector->>Etcd: short-timeout Status probe (TLS)
        Etcd-->>Detector: health response

        alt Etcd healthy & Hook remains not-ready past threshold
            Detector->>Systemctl: run "systemctl stop microshift-etcd.scope"
            Systemctl-->>Detector: result
            Detector->>Run: signal deadlock detected (close channel)
            Run->>Run: log error, attempt recovery, return deadlock error
        else Hook ready or threshold not reached
            Detector->>Detector: continue polling
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: adding RBAC bootstrap hook deadlock detection with fail-fast behavior.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files modified; custom check for stable test names not applicable.
Test Structure And Quality ✅ Passed No Ginkgo tests present in PR; custom check not applicable.
Microshift Test Compatibility ✅ Passed This PR only modifies pkg/controllers/kube-apiserver.go controller code and does not add any Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies pkg/controllers/kube-apiserver.go with RBAC bootstrap hook deadlock detection logic but does not add new Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies controller code for kube-apiserver without introducing deployment manifests, pod specs, or scheduling constraints that would affect SNO, Two-Node, or HyperShift clusters.
Ote Binary Stdout Contract ✅ Passed Modified file is runtime controller code, not a test binary, so OTE Binary Stdout Contract does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only application code to pkg/controllers/kube-apiserver.go with RBAC bootstrap hook deadlock detection logic, no Ginkgo e2e tests added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/apparentlymart/go-cidr@v1.1.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20190321100706-95778dfbb74e: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/miekg/dns@v1.1.63: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/api@v0.0.0-20260309155933-45fd88d185dd: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/build-machinery-go@v0.0.0-20251023084048-5d77c1a5e5af: is explicitly required in go.mod, but not marked as explicit

... [truncated 29518 characters] ...

belet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)

542-557: Function name is misleading.

restartMicroshiftEtcdScope only stops the scope; the restart occurs when MicroShift itself restarts after Run() returns an error. Consider renaming to stopMicroshiftEtcdScope for clarity.

♻️ Suggested rename
-// restartMicroshiftEtcdScope restarts the microshift-etcd.scope to recover from deadlock.
-// This forces a clean restart of etcd which can help break the circular dependency.
-func restartMicroshiftEtcdScope() error {
+// stopMicroshiftEtcdScope stops the microshift-etcd.scope to recover from deadlock.
+// MicroShift will restart both etcd and kube-apiserver after Run() returns an error.
+func stopMicroshiftEtcdScope() error {
 	klog.Info("Stopping microshift-etcd.scope for recovery")

Also update the call site at line 438.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 542 - 557, The function
restartMicroshiftEtcdScope is misnamed because it only stops the scope; rename
the function to stopMicroshiftEtcdScope (and update its comment and klog
messages accordingly) and update every call site that invokes
restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope instead; ensure
exported/internal visibility remains unchanged and run tests/build to verify no
references remain to the old name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 488-496: The current RBAC bootstrap loop resets checkCount to 0
when etcd is unhealthy which allows flapping to prevent ever reaching maxChecks;
modify the logic in the kube-apiserver.go RBAC detection loop (around variables
checkCount, maxChecks, and etcdHealthy) to track a wall-clock deadline (e.g.,
startTime or deadline) when the check begins and do not reset that deadline on
etcd state changes—always compare time.Since(startTime) (or
time.Now().After(deadline)) against a configured maxWaitDuration and fail/exit
the loop if exceeded; keep the existing checkCount logic for deadlock detection
but add this independent timeout check so flapping cannot extend detection
indefinitely.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 542-557: The function restartMicroshiftEtcdScope is misnamed
because it only stops the scope; rename the function to stopMicroshiftEtcdScope
(and update its comment and klog messages accordingly) and update every call
site that invokes restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope
instead; ensure exported/internal visibility remains unchanged and run
tests/build to verify no references remain to the old name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8e1dd33d-45f9-439d-91af-e5c44c9150fc

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and db029bd.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go
@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented Apr 7, 2026

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 469-472: The loop that decides to signal deadlock is incrementing
checkCount and falling through to close(deadlockDetected) on
rbacHookMaxWaitDuration expiry even when isEtcdHealthy() or the RBAC probe
errored; modify the loop (the block around checkCount, isEtcdHealthy,
rbacHookMaxWaitDuration, maxChecks, and close(deadlockDetected)) so that
checkCount is only incremented when the "healthy-etcd stall" predicate is
actually observed (i.e., when both the RBAC probe is unfinished AND
isEtcdHealthy() returns true), ignore/skip increments when probes or health
checks error, and on hitting rbacHookMaxWaitDuration return early unless
checkCount has already reached maxChecks; ensure close(deadlockDetected) is only
called after the predicate has been confirmed maxChecks times.
- Around line 483-486: The RBAC readiness probe call using
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx)...
uses the long-lived controller ctx and can hang; wrap that request in a short
per-probe context with timeout (e.g., context.WithTimeout(ctx, 2*time.Second)),
use the derived probeCtx in Do(probeCtx), and defer cancel() so the goroutine
unblocks and the 2s interval / 30s max-wait logic can re-evaluate; update the
call that sets status and err (StatusCode(&status).Error()) to use the probe
context.
- Around line 558-563: The call to stopCmd.CombinedOutput() in
restartMicroshiftEtcdScope is unbounded and can hang; wrap the systemctl
invocation with a context deadline (e.g., create a context.WithTimeout and use
exec.CommandContext) and use that context when constructing stopCmd, then read
CombinedOutput (or CombinedOutput on the context-backed command) and return a
timeout-aware error if the context expires; reference
restartMicroshiftEtcdScope, stopCmd and CombinedOutput and ensure you cancel the
context and propagate/annotate context timeout errors in the returned fmt.Errorf
message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6d409d90-afaf-448d-81e1-02888bc827c6

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go Outdated
Comment thread pkg/controllers/kube-apiserver.go
Comment thread pkg/controllers/kube-apiserver.go
@copejon copejon changed the title USHIFT-6401: Add fail-fast RBAC bootstrap hook deadlock detection USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection Apr 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 8, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@copejon copejon marked this pull request as ready for review April 16, 2026 14:12
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026
@openshift-ci openshift-ci bot requested review from eslutsky and pacevedom April 16, 2026 14:13
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
  • Introduced detection mechanism for RBAC bootstrap initialization deadlocks with automatic recovery procedures. System now monitors initialization health and attempts service restart when deadlock conditions are identified to improve overall reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/controllers/kube-apiserver.go (1)

468-515: ⚠️ Potential issue | 🟠 Major

Wall-clock deadline break still triggers false deadlock signal.

When rbacHookMaxWaitDuration expires at line 472, the loop breaks and falls through to close(deadlockDetected) at line 515—even if etcd was never confirmed healthy. The "etcd healthy + hook stuck" predicate may never have been observed.

Additionally, checkCount++ at line 481 executes before conditions are verified. If isEtcdHealthy errors, the count advances without observing the deadlock predicate.

Proposed fix
+	deadlockPredicateObserved := false
+
 	for checkCount < maxChecks {
 		// Check absolute deadline first - this cannot be reset by etcd state changes
 		if time.Since(startTime) >= rbacHookMaxWaitDuration {
 			klog.Errorf("RBAC bootstrap hook exceeded maximum wait duration of %v", rbacHookMaxWaitDuration)
-			break
+			if !deadlockPredicateObserved {
+				klog.Warning("Wall-clock deadline reached but deadlock predicate never confirmed - not signaling deadlock")
+				return
+			}
+			break // predicate was observed, signal deadlock
 		}
 
 		select {
 		case <-ctx.Done():
 			return
 		case <-time.After(rbacHookCheckInterval * time.Second):
 		}
 
-		checkCount++
-
 		// Check RBAC hook status
 		var status int
 		err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error()
 
 		// If hook is ready, no deadlock
 		if err == nil && status == 200 {
 			klog.V(4).Info("RBAC bootstrap hook completed successfully")
 			return
 		}
 
 		// Hook not ready - check if etcd is healthy
 		etcdHealthy, etcdErr := isEtcdHealthy(ctx)
 		if etcdErr != nil {
 			klog.V(4).Infof("Could not check etcd health: %v", etcdErr)
 			continue
 		}
 
 		if etcdHealthy {
+			checkCount++
+			deadlockPredicateObserved = true
 			klog.Warningf("RBAC bootstrap hook not ready (check %d/%d, elapsed %v), but etcd is healthy - potential deadlock",
 				checkCount, maxChecks, time.Since(startTime).Round(time.Second))
 		} else {
 			// etcd not healthy - not a deadlock, just waiting for etcd
 			klog.V(4).Infof("RBAC hook waiting, etcd not yet healthy (check %d/%d)", checkCount, maxChecks)
-			// Reset counter since this isn't a deadlock condition
-			// Note: wall-clock deadline (startTime) is NOT reset - flapping cannot extend indefinitely
 			checkCount = 0
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 468 - 515, The loop currently
can fall through to close(deadlockDetected) on wall-clock timeout even if etcd
was never observed healthy and also increments checkCount before confirming etcd
health; fix by tracking whether etcd has ever been observed healthy (e.g., add
sawEtcdHealthy bool set to true when isEtcdHealthy(ctx) returns (true, nil)),
only increment checkCount when etcdHealthy==true and etcdErr==nil (move the
checkCount++ after the etcd health success path), and after the loop only signal
deadlock (close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook
still isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.
🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)

566-568: Consider context-aware wait.

The 1s time.Sleep isn't interruptible. Low impact but a select with ctx.Done() would be cleaner for graceful shutdown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 566 - 568, Replace the
uninterruptible time.Sleep(1 * time.Second) with a context-aware wait so
shutdown can be cancelled: where the sleep occurs (the time.Sleep call in
kube-apiserver.go cleanup/wait block), use a select that waits on time.After(1 *
time.Second) and also listens on ctx.Done(), returning early if ctx is
cancelled; ensure you reference the existing ctx variable and keep the same
timeout duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 468-515: The loop currently can fall through to
close(deadlockDetected) on wall-clock timeout even if etcd was never observed
healthy and also increments checkCount before confirming etcd health; fix by
tracking whether etcd has ever been observed healthy (e.g., add sawEtcdHealthy
bool set to true when isEtcdHealthy(ctx) returns (true, nil)), only increment
checkCount when etcdHealthy==true and etcdErr==nil (move the checkCount++ after
the etcd health success path), and after the loop only signal deadlock
(close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook still
isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 566-568: Replace the uninterruptible time.Sleep(1 * time.Second)
with a context-aware wait so shutdown can be cancelled: where the sleep occurs
(the time.Sleep call in kube-apiserver.go cleanup/wait block), use a select that
waits on time.After(1 * time.Second) and also listens on ctx.Done(), returning
early if ctx is cancelled; ensure you reference the existing ctx variable and
keep the same timeout duration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8aa2f8a5-61b0-40e5-a4e9-2c35b99a9835

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented Apr 16, 2026

/retest

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 17, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
  • Added runtime detection for RBAC bootstrap deadlocks with automatic recovery: the system now detects stuck initialization, logs an error, and attempts a bounded service restart to recover initialization and improve startup reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

copejon and others added 3 commits April 17, 2026 18:37
…tion

On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd
contains existing data. The hook uses context.TODO() for API calls,
which has no timeout. When the loopback client hangs, this creates a
circular dependency where the hook waits for the API server while the
API server waits for the hook to complete.

This change adds a parallel deadlock detector that:
- Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
- Checks if etcd is healthy while the hook is stuck
- Detects deadlock in ~15 seconds instead of waiting 60 seconds
- Restarts microshift-etcd.scope to recover from the deadlock

This breaks the crash loop by detecting the condition early and taking
recovery action at the MicroShift level, without requiring changes to
vendored upstream Kubernetes code.

Related upstream issues: kubernetes/kubernetes#86715, #97119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add rbacHookMaxWaitDuration (30s) as an absolute deadline that cannot
be reset by etcd health state changes. This prevents a flapping etcd
from extending the deadlock detection indefinitely.

The existing checkCount logic is preserved for detecting deadlock when
etcd is consistently healthy, but the wall-clock deadline provides a
hard upper bound regardless of etcd state transitions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…itives

- Only increment checkCount when deadlock predicate confirmed
  (RBAC not ready AND etcd healthy)
- Skip counting when RBAC probe or etcd health check errors
- On wall-clock timeout, only trigger recovery if checkCount >= maxChecks
- Add 1s timeout to RBAC probe to prevent hanging on unresponsive API
- Add 5s timeout to systemctl stop to prevent recovery path from stalling
- Extract rbacHookPollDelayStart constant for clarity

Prevents false positive deadlock detection when etcd flaps or probes
error, ensuring close(deadlockDetected) only fires after confirming
the deadlock condition the required number of times.

Related: kubernetes/kubernetes#86715, #97119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 59-69: rbac timing constants rbacHookDeadlockTimeout and
rbacHookCheckInterval are plain integers but used as time.Duration; change them
to durations by multiplying by time.Second (rbacHookDeadlockTimeout = 15 *
time.Second and rbacHookCheckInterval = 2 * time.Second), then update the
log/format that prints the timeout to use %s (duration) instead of “%d seconds”,
and remove any places that multiply rbacHookCheckInterval by time.Second (e.g.
the spot that computes maxChecks / or multiplies the interval) so all arithmetic
uses time.Duration consistently.
- Around line 496-515: The current RBAC probe logic treats any non-nil err from
restClient.Get() as a transport error and skips deadlock counting; change the
condition so only requests with no HTTP status code (transport/timeout errors)
are skipped — i.e., update the check around restClient.Get()/probeCtx to
skip/continue only when err != nil AND status == 0 (use the existing status
integer and err from the Do(...).StatusCode(&status).Error() call), while
allowing HTTP responses (even 5xx where status != 0) to be counted toward the
deadlock check and preserve the existing klog message for the skipped case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 7ca5fc05-cf7b-4e48-970f-e7400f1307ef

📥 Commits

Reviewing files that changed from the base of the PR and between 7daee5d and 184e1cb.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment on lines +59 to +69
// rbacHookDeadlockTimeout is the time to wait for the RBAC bootstrap hook
// before declaring a deadlock. This is shorter than kubeAPIStartupTimeout
// to allow for faster recovery.
rbacHookDeadlockTimeout = 15
// rbacHookCheckInterval is how often to check the RBAC hook status
rbacHookPollDelayStart = 5 * time.Second
rbacHookCheckInterval = 2
// rbacHookMaxWaitDuration is the absolute maximum time to wait for the RBAC hook
// regardless of etcd health state changes. This prevents flapping from extending
// detection indefinitely.
rbacHookMaxWaitDuration = 30 * time.Second
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expected current result: negative maxChecks. Expected fixed result: 5.
python - <<'PY'
second = 1_000_000_000
n = 15 - 5 * second
go_trunc_div = -(abs(n) // 2)
print("current maxChecks:", go_trunc_div)
print("fixed maxChecks:", (15 * second - 5 * second) // (2 * second))
PY

rg -n 'rbacHookDeadlockTimeout|rbacHookPollDelayStart|rbacHookCheckInterval|maxChecks' pkg/controllers/kube-apiserver.go

Repository: openshift/microshift

Length of output: 1035


Fix type mismatch in RBAC timing constants.

Lines 62 and 65 define integer values but are used in Duration arithmetic at line 469, producing maxChecks = -2,499,999,992. This causes the deadlock check to trigger immediately, breaking RBAC bootstrap detection.

Add * time.Second to lines 62 and 65:

  • rbacHookDeadlockTimeout = 15 * time.Second
  • rbacHookCheckInterval = 2 * time.Second

Then update line 446 to use %s format instead of %d seconds, and remove the * time.Second conversion from line 493 since rbacHookCheckInterval becomes a Duration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 59 - 69, rbac timing
constants rbacHookDeadlockTimeout and rbacHookCheckInterval are plain integers
but used as time.Duration; change them to durations by multiplying by
time.Second (rbacHookDeadlockTimeout = 15 * time.Second and
rbacHookCheckInterval = 2 * time.Second), then update the log/format that prints
the timeout to use %s (duration) instead of “%d seconds”, and remove any places
that multiply rbacHookCheckInterval by time.Second (e.g. the spot that computes
maxChecks / or multiplies the interval) so all arithmetic uses time.Duration
consistently.

Comment on lines +496 to +515
// Check RBAC hook status
probeCtx, cancel := context.WithTimeout(ctx, time.Second)
var status int
err := restClient.Get().
AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").
Do(probeCtx).
StatusCode(&status).
Error()
cancel()

// If hook is ready, no deadlock
if err == nil && status == 200 {
klog.V(4).Info("RBAC bootstrap hook completed successfully")
return
}

// If RBAC probe errored, skip this iteration (don't count toward deadlock)
if err != nil {
klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n 'k8s.io/client-go' go.mod

modcache="$(go env GOPATH 2>/dev/null)/pkg/mod"
rg -n -C4 'func \(r Result\) (Error|StatusCode)\(' "$modcache"/k8s.io/client-go@* 2>/dev/null || true

Repository: openshift/microshift

Length of output: 212


🏁 Script executed:

cat -n pkg/controllers/kube-apiserver.go | sed -n '490,520p'

Repository: openshift/microshift

Length of output: 1110


🏁 Script executed:

find . -path ./deps -prune -o -type f -name "*.go" -print | xargs grep -l "type Result" 2>/dev/null | head -5

Repository: openshift/microshift

Length of output: 351


🏁 Script executed:

# Look for the local client-go implementation
find ./deps -type f -name "*.go" -path "*client-go*" | grep -i result | head -5

Repository: openshift/microshift

Length of output: 698


🏁 Script executed:

# Search for Result type and Error method definition in the local client-go
rg -A10 "type Result struct" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/ 2>/dev/null | head -30

Repository: openshift/microshift

Length of output: 1171


🏁 Script executed:

rg -A5 "func \(r \*Result\) Error\(\)" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 46


🏁 Script executed:

rg -A5 "func \(r \*Result\) StatusCode\(" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 46


🏁 Script executed:

# Look at the Do() method to understand what happens
rg -B5 -A15 "func \(r \*Request\) Do\(ctx" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -50

Repository: openshift/microshift

Length of output: 742


🏁 Script executed:

rg -A20 "func \(r \*Request\) transformResponse" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -60

Repository: openshift/microshift

Length of output: 1126


🏁 Script executed:

# Search for StatusCode and Error methods more broadly
grep -n "func.*Result.*StatusCode\|func.*Result.*Error" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 158


🏁 Script executed:

# Get a broader view of Result methods
rg -n "^func \(r.*Result\)" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 486


🏁 Script executed:

sed -n '1433,1448p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 664


🏁 Script executed:

sed -n '1489,1515p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 1043


🏁 Script executed:

# Also check transformResponse to see if statusCode is populated for error responses
rg -A50 "func \(r \*Request\) transformResponse" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | grep -A50 "resp.StatusCode"

Repository: openshift/microshift

Length of output: 46


🏁 Script executed:

# Let's see more of transformResponse to understand when err is set
sed -n '1269,1350p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

Repository: openshift/microshift

Length of output: 3859


🏁 Script executed:

# Check where statusCode is assigned in transformResponse
rg -n "statusCode.*=" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -10

Repository: openshift/microshift

Length of output: 151


🏁 Script executed:

# Search for where statusCode is assigned in transformResponse
rg -B5 -A5 "result.statusCode\|statusCode:" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -80

Repository: openshift/microshift

Length of output: 46


🏁 Script executed:

# Get the full transformResponse implementation
rg -A100 "^func \(r \*Request\) transformResponse\(ctx context.Context" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -150

Repository: openshift/microshift

Length of output: 3722


HTTP error responses should be distinguished from transport errors.

When the server returns HTTP 5xx status codes, transformResponse() sets both statusCode and err fields in the Result. The current condition at line 513 treats this same as network/timeout errors and skips the iteration. This prevents the deadlock detector from counting the "hook not ready" state.

Modify the condition to skip only when there's no HTTP status code (transport/timeout error):

Proposed fix
-		err := restClient.Get().
+		result := restClient.Get().
 			AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").
 			Do(probeCtx).
-			StatusCode(&status).
-			Error()
+			StatusCode(&status)
+		err := result.Error()
 		cancel()
 
 		// If hook is ready, no deadlock
 		if err == nil && status == 200 {
 			klog.V(4).Info("RBAC bootstrap hook completed successfully")
 			return
 		}
 
-		// If RBAC probe errored, skip this iteration (don't count toward deadlock)
-		if err != nil {
+		// If no HTTP status was received, skip this iteration (transport/timeout).
+		if err != nil && status == 0 {
 			klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err)
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 496 - 515, The current RBAC
probe logic treats any non-nil err from restClient.Get() as a transport error
and skips deadlock counting; change the condition so only requests with no HTTP
status code (transport/timeout errors) are skipped — i.e., update the check
around restClient.Get()/probeCtx to skip/continue only when err != nil AND
status == 0 (use the existing status integer and err from the
Do(...).StatusCode(&status).Error() call), while allowing HTTP responses (even
5xx where status != 0) to be counted toward the deadlock check and preserve the
existing klog message for the skipped case.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 18, 2026

@copejon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-tests-arm 184e1cb link true /test e2e-aws-tests-arm
ci/prow/ocp-full-conformance-rhel-eus 184e1cb link true /test ocp-full-conformance-rhel-eus
ci/prow/ocp-full-conformance-serial-rhel-eus 184e1cb link true /test ocp-full-conformance-serial-rhel-eus
ci/prow/e2e-aws-tests-bootc-arm-el9 184e1cb link true /test e2e-aws-tests-bootc-arm-el9
ci/prow/verify 184e1cb link true /test verify
ci/prow/e2e-aws-tests-bootc-arm-el10 184e1cb link true /test e2e-aws-tests-bootc-arm-el10
ci/prow/e2e-aws-tests 184e1cb link true /test e2e-aws-tests
ci/prow/e2e-aws-tests-bootc-el9 184e1cb link true /test e2e-aws-tests-bootc-el9
ci/prow/e2e-aws-tests-bootc-el10 184e1cb link true /test e2e-aws-tests-bootc-el10

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants